Prior to this lecture, you should have read chapter 4 of Regression and Other Stories.
I have a data set of commute times based on a random sample of 100 California households. Based on that sample, I can calculate an average commute time.
mean(commutes_100a$TRANTIME)
## [1] 31.1
But if I had sampled a different set of 100 households from the same population, I could have gotten a slightly different average.
mean(commutes_100b$TRANTIME)
## [1] 28.21
mean(commutes_100c$TRANTIME)
## [1] 27.01
mean(commutes_100d$TRANTIME)
## [1] 31.99
All of these averages will tend to be clustered around the actual population average, even if none of them will be exactly equal to the population average.
A one-sample t-test uses the mean and standard deviation of a sample to calculate a confidence interval for the population mean - a range of values that the the real average of the population probably falls within.
Here is how you would get a 90-percent confidence interval for the average commute time in R.
t.test(commutes_100a$TRANTIME, conf.level = 0.9)
##
## One Sample t-test
##
## data: commutes_100a$TRANTIME
## t = 11.863, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 26.7473 35.4527
## sample estimates:
## mean of x
## 31.1
Look at the two values listed under
90 percent confidence interval:. You can interpret that to
mean that you can be 90 percent confident that averate commute time for
the full population would be between 26.7 and 35.5 minutes.
You can also calculate confidence interval for the population mean in Excel, but you’ll need do do it in a couple steps.
First, you would calculate the standard deviation and average of your sample data. Then, use the CONFIDENCE.T() function to calculate the margin for the confidence interval using three arguments: alpha (one minus the confidence level), the standard deviation, and the sample size (in the example below, I use the COUNT() function to get the sample size). The confidence interval is the sample mean, plus or minus this value.
Three things influence a the width of a confidence interval:
You can also use a one-sample t-test to calculate the confidence interval for the proportion of the population that falls into a category. Here is how I would find the 90 percent confidence interval for the proportion of the population that commutes by car.
t.test(commutes_100a$mode == "Car", conf.level = 0.9)
##
## One Sample t-test
##
## data: commutes_100a$mode == "Car"
## t = 28.302, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 0.8377863 0.9422137
## sample estimates:
## mean of x
## 0.89
This result means I can be 90 percent confident that the share of the full population that commutes by car is between 84 percent and 94 percent.
I can get a similar result in Excel.
You can use group_by() and get_summary_stats() in R to produce a table that gives an average value within each category, along with the 95-percent confidence interval for each average.
library(rstatix)
income_by_mode <- commuting %>%
group_by(mode) %>%
get_summary_stats(INCTOT, type = "mean_ci") %>%
mutate(ci_low = mean - ci,
ci_hi = mean + ci)
income_by_mode %>%
kable(digits = c(rep(0, 15), 3, 0)) %>%
scroll_box(width = "75%")
| mode | variable | n | mean | ci | ci_low | ci_hi |
|---|---|---|---|---|---|---|
| Bike | INCTOT | 8390 | 73987 | 2049 | 71938 | 76037 |
| Car | INCTOT | 715885 | 68652 | 198 | 68454 | 68850 |
| Other | INCTOT | 13682 | 69853 | 1633 | 68220 | 71487 |
| Transit | INCTOT | 42950 | 70218 | 834 | 69384 | 71053 |
| Walk | INCTOT | 23893 | 47564 | 963 | 46601 | 48528 |
In the table above, the 95-percent confidence interval for the average income of those who bike to work is $71,938 to $75,037. In other words, we can be 95-percent confident that the average values for all cyclists in the full population is within that range.
Error bars can be a helful way to visualize these confidence intervals.
ggplot(income_by_mode) +
geom_col(aes(x = mode, y = mean)) +
geom_errorbar(aes(x = mode,
ymin = ci_low,
ymax = ci_hi),
width = 0.2) +
scale_y_continuous(name = "Average income",
breaks = breaks <- seq(0, 90000, by = 10000),
labels = paste0("$",
prettyNum(breaks,
big.mark = ","))) +
scale_x_discrete(name = "Usual mode of travel to work") +
theme_minimal()
If the population average within a category is a range rather than a single number, how do you compare the averages between two groups?
A two-sample t-test can tell us if there is a statistically significant difference in the averages between two categories.
A statistically significant difference means we can have an acceptable level of confidence (usually 95 percent confidence) that the two averages are not the same.
Here is how you would calculate the difference in average income between all possible pairs of mode categories.
library(rstatix)
comp_income_by_mode <- commuting %>%
t_test(INCTOT ~ mode, detailed = TRUE, conf.level = 0.9)
comp_income_by_mode %>%
kable(digits = c(rep(0, 15), 3, 0)) %>%
scroll_box(width = "75%")
| estimate | estimate1 | estimate2 | .y. | group1 | group2 | n1 | n2 | statistic | p | df | conf.low | conf.high | method | alternative | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5336 | 73987 | 68652 | INCTOT | Bike | Car | 8390 | 715885 | 5 | 0 | 8546 | 3608 | 7063 | T-test | two.sided | 0.000 | **** |
| 4134 | 73987 | 69853 | INCTOT | Bike | Other | 8390 | 13682 | 3 | 0 | 17986 | 1935 | 6333 | T-test | two.sided | 0.006 | ** |
| 3769 | 73987 | 70218 | INCTOT | Bike | Transit | 8390 | 42950 | 3 | 0 | 11339 | 1912 | 5626 | T-test | two.sided | 0.003 | ** |
| 26423 | 73987 | 47564 | INCTOT | Bike | Walk | 8390 | 23893 | 23 | 0 | 12298 | 24523 | 28323 | T-test | two.sided | 0.000 | **** |
| -1201 | 68652 | 69853 | INCTOT | Car | Other | 715885 | 13682 | -1 | 0 | 14085 | -2582 | 180 | T-test | two.sided | 0.304 | ns |
| -1567 | 68652 | 70218 | INCTOT | Car | Transit | 715885 | 42950 | -4 | 0 | 47906 | -2286 | -847 | T-test | two.sided | 0.002 | ** |
| 21087 | 68652 | 47564 | INCTOT | Car | Walk | 715885 | 23893 | 42 | 0 | 25947 | 20262 | 21913 | T-test | two.sided | 0.000 | **** |
| -365 | 69853 | 70218 | INCTOT | Other | Transit | 13682 | 42950 | 0 | 1 | 21286 | -1904 | 1174 | T-test | two.sided | 0.696 | ns |
| 22289 | 69853 | 47564 | INCTOT | Other | Walk | 13682 | 23893 | 23 | 0 | 23245 | 20697 | 23880 | T-test | two.sided | 0.000 | **** |
| 22654 | 70218 | 47564 | INCTOT | Transit | Walk | 42950 | 23893 | 35 | 0 | 55719 | 21585 | 23723 | T-test | two.sided | 0.000 | **** |
The correlation between two continuous variables is a measure of how closely their scatter plot resembles a straight line or how well the value of one variable can predict the value of the other. Correlations can range from negative 1 (a downward-sloping straight line) to positive 1 (an upward-sloping straight line).
A correlation of zero means there is no (linear) relationship between the two variables.
Remember that a variable with a log-normal distribution will have a lot of small values that are close together, and a few more spread-out larger values.
Here is a scatter plot of two log-normally distributed variables.
ggplot(commutes_5000) +
geom_point(aes(x = INCTOT, y = TRANTIME),
size = 0.1) +
theme_minimal()
And here is the same set of variables with the x- and y-axes on a log scale.
ggplot(commutes_5000) +
geom_point(aes(x = INCTOT, y = TRANTIME),
size = 0.1) +
scale_x_continuous(trans = "log") +
scale_y_continuous(trans = "log") +
theme_minimal()
You’ll find that the correlation between the two variables is less than the correlation between the logs of the two variables.
cor(commutes_5000$INCTOT,
commutes_5000$TRANTIME)
## [1] 0.08644618
cor(log(commutes_5000$INCTOT),
log(commutes_5000$TRANTIME))
## [1] 0.1613641
This means that there is a relationship between these two variables, but it isn’t a linear relationship.
Here’s a simpler (and more extreme) example. There is clearly a strong relationship between these two variables.
ggplot(square) +
geom_point(aes(x = X, y = Y)) +
theme_minimal()
But the correlation between X and Y in the plot above is zero.
cor(square$X, square$Y)
## [1] 0
I can transform X by squaring it.
ggplot(square) +
geom_point(aes(x = X^2, y = Y)) +
theme_minimal()
The correlation between X and Y was zero, but the correlation between the square of X and Y is 1.
cor(square$X^2,
square$Y)
## [1] 1
Just because there is a non-zero correlation between two variables in our sample, that doesn’t mean there would be a non-zero correlation between those variables for our full sample. We can also calculate a confidence interval for a correlation.
cor.test(log(commutes_5000$INCTOT),
log(commutes_5000$TRANTIME))
##
## Pearson's product-moment correlation
##
## data: log(commutes_5000$INCTOT) and log(commutes_5000$TRANTIME)
## t = 11.557, df = 4996, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1342399 0.1882468
## sample estimates:
## cor
## 0.1613641